Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

hc-github-team-consul-core · 2023-10-26T20:08:00Z

Backport

This PR is auto-generated from #19268 to be assessed for backporting due to the inclusion of the label backport/1.16.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@andrewstucki
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/consul/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.

Description

This fixes #17557. In an attempt to support mesh gateways fronted by AWS load balancers, a code path for peered mesh gateways was introduced in #14917 that leverages envoy clusters backed by the LOGICAL_DNS cluster discovery type. Problematically, when replicas of mesh gateways exist in a peered connection, the dialing peer will hit this code path and attempt to add multiple endpoints for the targeted mesh gateways. Envoy, however, doesn't support multiple endpoints using LOGICAL_DNS and will start spitting out errors applying the xDS it receives from Consul.

In Kubernetes, when a mesh gateway restarts then, it will never finish initializing and get marked as healthy, so its pod will continually restart and the gateway becomes unusable.

Because this requires using hostnames rather than IP addresses for the WAN addresses registered for mesh gateways, it likely impacts mostly Consul users on AWS, where hostnames are used for LoadBalancer services and thus registered for LoadBalancer type mesh gateways. This will also affect users who manually (or with annotations) register mesh gateways with mutiple FQDNs.

Note that this appears to only affect the dialing cluster in a peered connection, the accepting clusters use a different code path that only ever uses a single mesh gateway target and doesn't attempt to load-balance between multiple mesh gateways.

Testing & Reproduction steps

I was able to recreate this pretty easily outside of AWS by pinning the FQDN of the mesh gateways in the accepting cluster via something like:

meshGateway:
  enabled: true
  replicas: 2
  wanAddress:
    source: "Static"
    static: "gateway.nanosleep.cloud"

which gives this for my dialing cluster:

curl https://${DC2_CONSUL}/v1/peerings ... | jq
...
"PeerServerAddresses": [
  "gateway.nanosleep.cloud:443",
  "gateway.nanosleep.cloud:443"
],
...

And dialing cluster Consul logs then show:

2023-10-17T22:37:29.016Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=default/default/consul-consul-mesh-gateway-6ff745887b-5c5s2 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000006 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.dc1.peering.303380e1-f1a6-fb04-4ca6-c562e4951539.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"

and dialing cluster mesh gateway:

2023-10-17T22:36:20.838Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.dc1.peering.303380e1-f1a6-fb04-4ca6-c562e4951539.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint

Swapping to STRICT_DNS allows the mesh gateway to finish configuration and boot properly.

Links

Strict DNS in envoy.

PR Checklist

updated test coverage
external facing docs updated
appropriate backport labels added
not a security concern

Overview of commits

e9eabcb - 013de0b

hashicorp-cla · 2023-10-26T20:08:05Z

All committers have signed the CLA.

github-team-consul-core-pr-approver

Auto approved Consul Bot automated PR

hc-github-team-consul-core force-pushed the backport/net-4786/mesh-strict-dns/friendly-witty-dodo branch from 1603356 to 601b67d Compare October 26, 2023 20:08

hc-github-team-consul-core assigned andrewstucki Oct 26, 2023

hc-github-team-consul-core requested a review from andrewstucki October 26, 2023 20:08

github-team-consul-core-pr-approver approved these changes Oct 26, 2023

View reviewed changes

vercel bot temporarily deployed to Preview – consul October 26, 2023 20:12 Inactive

Use strict DNS for mesh gateways with hostnames

b8a02fe

andrewstucki force-pushed the backport/net-4786/mesh-strict-dns/friendly-witty-dodo branch from 3d0d19b to b8a02fe Compare October 27, 2023 15:08

andrewstucki marked this pull request as ready for review October 27, 2023 15:08

andrewstucki merged commit 0fd8cdb into release/1.16.x Oct 27, 2023
83 checks passed

andrewstucki deleted the backport/net-4786/mesh-strict-dns/friendly-witty-dodo branch October 27, 2023 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

hc-github-team-consul-core commented Oct 26, 2023

hashicorp-cla commented Oct 26, 2023 •

edited

Loading

github-team-consul-core-pr-approver left a comment

Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

Conversation

hc-github-team-consul-core commented Oct 26, 2023

Backport

Description

Testing & Reproduction steps

Links

PR Checklist

hashicorp-cla commented Oct 26, 2023 • edited Loading

github-team-consul-core-pr-approver left a comment

Choose a reason for hiding this comment

hashicorp-cla commented Oct 26, 2023 •

edited

Loading